In [1]:
import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))
/kaggle/input/new-york-city-buildings-energy-consumption-survey/Dockerfile
/kaggle/input/new-york-city-buildings-energy-consumption-survey/LICENSE
/kaggle/input/new-york-city-buildings-energy-consumption-survey/.gitignore
/kaggle/input/new-york-city-buildings-energy-consumption-survey/README.md
/kaggle/input/new-york-city-buildings-energy-consumption-survey/requirements.txt
/kaggle/input/new-york-city-buildings-energy-consumption-survey/model/app.yaml
/kaggle/input/new-york-city-buildings-energy-consumption-survey/model/README.md
/kaggle/input/new-york-city-buildings-energy-consumption-survey/model/XGBModel/conda.yaml
/kaggle/input/new-york-city-buildings-energy-consumption-survey/model/XGBModel/MLmodel
/kaggle/input/new-york-city-buildings-energy-consumption-survey/model/XGBModel/model.xgb
/kaggle/input/new-york-city-buildings-energy-consumption-survey/notebooks/GreenBuildings3.ipynb
/kaggle/input/new-york-city-buildings-energy-consumption-survey/notebooks/GreenBuildings2.ipynb
/kaggle/input/new-york-city-buildings-energy-consumption-survey/notebooks/GreenBuildings1.ipynb
/kaggle/input/new-york-city-buildings-energy-consumption-survey/notebooks/utilities/GetNeighborhoodNames.py
/kaggle/input/new-york-city-buildings-energy-consumption-survey/notebooks/utilities/CleaningFunctions.py
/kaggle/input/new-york-city-buildings-energy-consumption-survey/notebooks/utilities/PlottingFunctions.py
/kaggle/input/new-york-city-buildings-energy-consumption-survey/notebooks/images/DockerRun.png
/kaggle/input/new-york-city-buildings-energy-consumption-survey/notebooks/images/LinearModel2.png
/kaggle/input/new-york-city-buildings-energy-consumption-survey/notebooks/images/building_type.png
/kaggle/input/new-york-city-buildings-energy-consumption-survey/notebooks/images/pushtogcr.png
/kaggle/input/new-york-city-buildings-energy-consumption-survey/notebooks/images/GCR.png
/kaggle/input/new-york-city-buildings-energy-consumption-survey/notebooks/images/Energy_Star.png
/kaggle/input/new-york-city-buildings-energy-consumption-survey/notebooks/images/AppEngine.png
/kaggle/input/new-york-city-buildings-energy-consumption-survey/notebooks/images/XGBoostServe1_P2.png
/kaggle/input/new-york-city-buildings-energy-consumption-survey/notebooks/images/Correlations.png
/kaggle/input/new-york-city-buildings-energy-consumption-survey/notebooks/images/DockerImageSize.png
/kaggle/input/new-york-city-buildings-energy-consumption-survey/notebooks/images/bigquery.png
/kaggle/input/new-york-city-buildings-energy-consumption-survey/notebooks/images/Compare.png
/kaggle/input/new-york-city-buildings-energy-consumption-survey/notebooks/images/Success.png
/kaggle/input/new-york-city-buildings-energy-consumption-survey/notebooks/images/XGBoostRun.png
/kaggle/input/new-york-city-buildings-energy-consumption-survey/notebooks/images/XGBoostGrid.png
/kaggle/input/new-york-city-buildings-energy-consumption-survey/notebooks/images/LinearModel1.png
/kaggle/input/new-york-city-buildings-energy-consumption-survey/notebooks/images/deploy.png
/kaggle/input/new-york-city-buildings-energy-consumption-survey/notebooks/images/iforest.png
/kaggle/input/new-york-city-buildings-energy-consumption-survey/notebooks/images/EmptyMLFlow.png
/kaggle/input/new-york-city-buildings-energy-consumption-survey/notebooks/images/XGBoostServe1.png
/kaggle/input/new-york-city-buildings-energy-consumption-survey/data/nyc-zip-code-tabulation-areas-polygons.geojson
/kaggle/input/new-york-city-buildings-energy-consumption-survey/data/nyc_benchmarking_disclosure_data_reported_in_2016.xlsx
/kaggle/input/new-york-city-buildings-energy-consumption-survey/data/neighborhoods.pkl

Importing the Data set¶

In [2]:
df = pd.read_excel('/kaggle/input/new-york-city-buildings-energy-consumption-survey/data/nyc_benchmarking_disclosure_data_reported_in_2016.xlsx')

Preliminary Analysis¶

In [3]:
df.shape
Out[3]:
(13223, 57)
In [4]:
df.head(5)
Out[4]:
Record Number Order NYC Borough, Block and Lot (BBL) Co-reported BBL Status BBLs Co-reported Reported NYC Building Identification Numbers (BINs) Property Name Parent Property Id Parent Property Name Street Number ... Direct GHG Emissions (Metric Tons CO2e) Indirect GHG Emissions (Metric Tons CO2e) DOF Property Floor Area (ft²) Property GFA - Self-reported (ft²) Water Use (All Water Sources) (kgal) Municipally Supplied Potable Water - Indoor Intensity (gal/ft²) Release Date DEP Provided Water Use (kgal) Automatic Water Benchmarking Eligible Reported Water Method
0 4113711.0 1 1.003360e+09 NaN NaN 1088884 16 Bialystoker Place Not Applicable: Standalone Property Not Applicable: Standalone Property 15.0 ... 290.2 146.1 98052.0 98000.0 NaN NaN 2016-04-13 11:08:48 NaN NaN NaN
1 4630135.0 2 1.010618e+09 NaN NaN 1026728 (7478) - 432 West 52nd Street Condominium Not Applicable: Standalone Property Not Applicable: Standalone Property 432.0 ... 76.4 125.0 58851.0 57401.0 768.5 13.39 2016-04-26 21:52:05 NaN Not Eligible NaN
2 4626594.0 3 1.014270e+09 NaN NaN 1043974 (9250) - 250 East 73rd Street Not Applicable: Standalone Property Not Applicable: Standalone Property 1393.0 ... 145.5 392.9 166432.0 166433.0 11901.5 71.51 2016-04-26 22:56:36 11897.0 Eligible Manual
3 4626608.0 4 1.015180e+09 NaN NaN 1048117 (9141) - 141 East 89th Street Not Applicable: Standalone Property Not Applicable: Standalone Property 1341.0 ... 304.6 394.7 114939.0 114939.0 7455.1 64.86 2016-04-26 22:38:03 7455.1 Eligible ABS
4 5048763.0 5 1.003920e+09 NaN NaN 1079685 605 East 9th ST Not Applicable: Standalone Property Not Applicable: Standalone Property 350.0 ... 0.0 0.0 93420.0 111000.0 NaN NaN 2016-07-28 17:28:14 NaN NaN NaN

5 rows × 57 columns

In [5]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13223 entries, 0 to 13222
Data columns (total 57 columns):
 #   Column                                                           Non-Null Count  Dtype         
---  ------                                                           --------------  -----         
 0   Record Number                                                    11531 non-null  float64       
 1   Order                                                            13223 non-null  int64         
 2   NYC Borough, Block and Lot (BBL)                                 13222 non-null  float64       
 3   Co-reported BBL Status                                           615 non-null    object        
 4   BBLs Co-reported                                                 615 non-null    object        
 5   Reported NYC Building Identification Numbers (BINs)              11978 non-null  object        
 6   Property Name                                                    11531 non-null  object        
 7   Parent Property Id                                               11531 non-null  object        
 8   Parent Property Name                                             11531 non-null  object        
 9   Street Number                                                    10380 non-null  float64       
 10  Street Name                                                      13223 non-null  object        
 11  Zip Code                                                         12627 non-null  float64       
 12  Borough                                                          13223 non-null  object        
 13  DOF Benchmarking Submission Status                               13223 non-null  object        
 14  Primary Property Type - Self Selected                            11978 non-null  object        
 15  List of All Property Use Types at Property                       11978 non-null  object        
 16  Largest Property Use Type                                        11978 non-null  object        
 17  Largest Property Use Type - Gross Floor Area (ft²)               11527 non-null  float64       
 18  2nd Largest Property Use Type                                    11978 non-null  object        
 19  2nd Largest Property Use - Gross Floor Area (ft²)                11978 non-null  object        
 20  3rd Largest Property Use Type                                    11978 non-null  object        
 21  3rd Largest Property Use Type - Gross Floor Area (ft²)           11978 non-null  object        
 22  Year Built                                                       11978 non-null  object        
 23  Number of Buildings - Self-reported                              11526 non-null  float64       
 24  Occupancy                                                        11531 non-null  float64       
 25  Metered Areas (Energy)                                           11978 non-null  object        
 26  Metered Areas  (Water)                                           11978 non-null  object        
 27  ENERGY STAR Score                                                9535 non-null   float64       
 28  Site EUI (kBtu/ft²)                                              11439 non-null  float64       
 29  Weather Normalized Site EUI (kBtu/ft²)                           9848 non-null   float64       
 30  Weather Normalized Site Electricity Intensity (kWh/ft²)          11095 non-null  float64       
 31  Weather Normalized Site Natural Gas Intensity (therms/ft²)       9463 non-null   float64       
 32  Source EUI (kBtu/ft²)                                            11439 non-null  float64       
 33  Weather Normalized Source EUI (kBtu/ft²)                         9848 non-null   float64       
 34  Fuel Oil #1 Use (kBtu)                                           11978 non-null  object        
 35  Fuel Oil #2 Use (kBtu)                                           11978 non-null  object        
 36  Fuel Oil #4 Use (kBtu)                                           11978 non-null  object        
 37  Fuel Oil #5 & 6 Use (kBtu)                                       11978 non-null  object        
 38  Diesel #2 Use (kBtu)                                             11978 non-null  object        
 39  District Steam Use (kBtu)                                        11978 non-null  object        
 40  District Hot Water Use (kBtu)                                    11978 non-null  object        
 41  District Chilled Water Use (kBtu)                                11978 non-null  object        
 42  Natural Gas Use (kBtu)                                           10087 non-null  float64       
 43  Weather Normalized Site Natural Gas Use (therms)                 9465 non-null   float64       
 44  Electricity Use - Grid Purchase (kBtu)                           11425 non-null  float64       
 45  Weather Normalized Site Electricity (kWh)                        11097 non-null  float64       
 46  Total GHG Emissions (Metric Tons CO2e)                           11478 non-null  float64       
 47  Direct GHG Emissions (Metric Tons CO2e)                          11471 non-null  float64       
 48  Indirect GHG Emissions (Metric Tons CO2e)                        11495 non-null  float64       
 49  DOF Property Floor Area (ft²)                                    12627 non-null  float64       
 50  Property GFA - Self-reported (ft²)                               11531 non-null  float64       
 51  Water Use (All Water Sources) (kgal)                             7265 non-null   float64       
 52  Municipally Supplied Potable Water - Indoor Intensity (gal/ft²)  6391 non-null   float64       
 53  Release Date                                                     11531 non-null  datetime64[ns]
 54  DEP Provided Water Use (kgal)                                    5247 non-null   float64       
 55  Automatic Water Benchmarking Eligible                            7265 non-null   object        
 56  Reported Water Method                                            5247 non-null   object        
dtypes: datetime64[ns](1), float64(26), int64(1), object(29)
memory usage: 5.8+ MB
In [6]:
df.describe()
Out[6]:
Record Number Order NYC Borough, Block and Lot (BBL) Street Number Zip Code Largest Property Use Type - Gross Floor Area (ft²) Number of Buildings - Self-reported Occupancy ENERGY STAR Score Site EUI (kBtu/ft²) ... Weather Normalized Site Electricity (kWh) Total GHG Emissions (Metric Tons CO2e) Direct GHG Emissions (Metric Tons CO2e) Indirect GHG Emissions (Metric Tons CO2e) DOF Property Floor Area (ft²) Property GFA - Self-reported (ft²) Water Use (All Water Sources) (kgal) Municipally Supplied Potable Water - Indoor Intensity (gal/ft²) Release Date DEP Provided Water Use (kgal)
count 1.153100e+04 13223.00000 1.322200e+04 10380.000000 12627.000000 1.152700e+04 11526.000000 11531.000000 9535.000000 11439.000000 ... 1.109700e+04 1.147800e+04 1.147100e+04 1.149500e+04 1.262700e+04 1.153100e+04 7.265000e+03 6391.000000 11531 5247.000000
mean 3.440970e+06 6612.00000 2.198121e+09 996.288247 10569.869011 1.641104e+05 1.387819 98.987512 57.735711 525.733377 ... 2.380525e+06 6.952577e+03 1.340512e+03 5.605184e+03 1.636773e+05 1.709439e+05 2.579751e+04 126.303422 2016-05-15 11:24:30.392767488 6478.597370
min 7.365000e+03 1.00000 1.000048e+09 0.000000 0.000000 5.800000e+01 0.000000 0.000000 1.000000 0.000000 ... 0.000000e+00 0.000000e+00 0.000000e+00 -2.335860e+04 5.000800e+04 0.000000e+00 0.000000e+00 0.000000 2016-02-19 12:08:50 1.000000
25% 2.734774e+06 3306.50000 1.013110e+09 146.000000 10024.000000 6.536950e+04 1.000000 100.000000 34.000000 65.300000 ... 3.026190e+05 3.420250e+02 1.715000e+02 8.630000e+01 6.500000e+04 6.717600e+04 2.661700e+03 28.240000 2016-04-21 10:03:55 2876.300000
50% 3.111728e+06 6612.00000 2.029920e+09 441.000000 10456.000000 9.226600e+04 1.000000 100.000000 63.000000 82.400000 ... 5.454674e+05 5.198000e+02 3.051000e+02 1.580000e+02 9.240000e+04 9.491000e+04 4.745600e+03 45.700000 2016-04-29 09:31:42 4799.800000
75% 4.371063e+06 9917.50000 3.061218e+09 1300.250000 11222.000000 1.557285e+05 1.000000 100.000000 83.000000 103.000000 ... 1.311119e+06 9.394500e+02 5.050000e+02 4.067000e+02 1.563555e+05 1.615295e+05 8.057900e+03 71.680000 2016-05-31 13:43:25 7732.800000
max 5.053345e+06 13223.00000 5.080080e+09 76179.000000 11694.000000 6.940450e+06 161.000000 100.000000 100.000000 801504.700000 ... 3.046183e+09 1.501468e+07 5.852949e+06 1.501455e+07 1.343508e+07 6.940450e+06 4.385740e+07 73011.090000 2016-08-01 23:04:11 129781.600000
std 8.500920e+05 3817.29564 1.218340e+09 1550.762447 645.826741 2.569770e+05 3.631729 6.941736 30.143817 10120.105154 ... 3.576546e+07 1.692231e+05 5.949074e+04 1.583246e+05 2.993519e+05 2.680820e+05 5.860239e+05 1483.749849 NaN 7090.965112

8 rows × 28 columns

In [7]:
df.describe(include='object')
Out[7]:
Co-reported BBL Status BBLs Co-reported Reported NYC Building Identification Numbers (BINs) Property Name Parent Property Id Parent Property Name Street Name Borough DOF Benchmarking Submission Status Primary Property Type - Self Selected ... Fuel Oil #1 Use (kBtu) Fuel Oil #2 Use (kBtu) Fuel Oil #4 Use (kBtu) Fuel Oil #5 & 6 Use (kBtu) Diesel #2 Use (kBtu) District Steam Use (kBtu) District Hot Water Use (kBtu) District Chilled Water Use (kBtu) Automatic Water Benchmarking Eligible Reported Water Method
count 615 615 11978 11531 11531 11531 13223 13223 13223 11978 ... 11978 11978 11978 11978 11978 11978 11978 11978 7265 5247
unique 3 169 11385 11522 107 107 2274 8 3 53 ... 8 2291 1319 464 17 927 2 12 2 2
top Secondary 3017940054;3017940119;3017940121;3017940139;30... See Primary BBL Club Quarters Not Applicable: Standalone Property Not Applicable: Standalone Property Not Available Manhattan In Compliance Multifamily Housing ... Not Available Not Available Not Available Not Available Not Available Not Available Not Available Not Available Eligible ABS
freq 410 48 447 3 11330 11330 596 5694 11762 8699 ... 11525 8845 10108 10732 11510 10599 11531 11521 6317 5056

4 rows × 29 columns

In [8]:
df.isna().sum()
Out[8]:
Record Number                                                       1692
Order                                                                  0
NYC Borough, Block and Lot (BBL)                                       1
Co-reported BBL Status                                             12608
BBLs Co-reported                                                   12608
Reported NYC Building Identification Numbers (BINs)                 1245
Property Name                                                       1692
Parent Property Id                                                  1692
Parent Property Name                                                1692
Street Number                                                       2843
Street Name                                                            0
Zip Code                                                             596
Borough                                                                0
DOF Benchmarking Submission Status                                     0
Primary Property Type - Self Selected                               1245
List of All Property Use Types at Property                          1245
Largest Property Use Type                                           1245
Largest Property Use Type - Gross Floor Area (ft²)                  1696
2nd Largest Property Use Type                                       1245
2nd Largest Property Use - Gross Floor Area (ft²)                   1245
3rd Largest Property Use Type                                       1245
3rd Largest Property Use Type - Gross Floor Area (ft²)              1245
Year Built                                                          1245
Number of Buildings - Self-reported                                 1697
Occupancy                                                           1692
Metered Areas (Energy)                                              1245
Metered Areas  (Water)                                              1245
ENERGY STAR Score                                                   3688
Site EUI (kBtu/ft²)                                                 1784
Weather Normalized Site EUI (kBtu/ft²)                              3375
Weather Normalized Site Electricity Intensity (kWh/ft²)             2128
Weather Normalized Site Natural Gas Intensity (therms/ft²)          3760
Source EUI (kBtu/ft²)                                               1784
Weather Normalized Source EUI (kBtu/ft²)                            3375
Fuel Oil #1 Use (kBtu)                                              1245
Fuel Oil #2 Use (kBtu)                                              1245
Fuel Oil #4 Use (kBtu)                                              1245
Fuel Oil #5 & 6 Use (kBtu)                                          1245
Diesel #2 Use (kBtu)                                                1245
District Steam Use (kBtu)                                           1245
District Hot Water Use (kBtu)                                       1245
District Chilled Water Use (kBtu)                                   1245
Natural Gas Use (kBtu)                                              3136
Weather Normalized Site Natural Gas Use (therms)                    3758
Electricity Use - Grid Purchase (kBtu)                              1798
Weather Normalized Site Electricity (kWh)                           2126
Total GHG Emissions (Metric Tons CO2e)                              1745
Direct GHG Emissions (Metric Tons CO2e)                             1752
Indirect GHG Emissions (Metric Tons CO2e)                           1728
DOF Property Floor Area (ft²)                                        596
Property GFA - Self-reported (ft²)                                  1692
Water Use (All Water Sources) (kgal)                                5958
Municipally Supplied Potable Water - Indoor Intensity (gal/ft²)     6832
Release Date                                                        1692
DEP Provided Water Use (kgal)                                       7976
Automatic Water Benchmarking Eligible                               5958
Reported Water Method                                               7976
dtype: int64
In [9]:
df.duplicated().sum()
Out[9]:
0

Removing unnecessary columns¶

In [10]:
df1 = df.drop(columns = ['Reported NYC Building Identification Numbers (BINs)', 'Record Number', 'Order', 'Street Number', 'Release Date', 'Co-reported BBL Status', 'BBLs Co-reported', 'Parent Property Id', 'Parent Property Name', 'List of All Property Use Types at Property', 'Reported Water Method', 'Number of Buildings - Self-reported', 'Occupancy'])
In [11]:
df1.sample(5)
Out[11]:
NYC Borough, Block and Lot (BBL) Property Name Street Name Zip Code Borough DOF Benchmarking Submission Status Primary Property Type - Self Selected Largest Property Use Type Largest Property Use Type - Gross Floor Area (ft²) 2nd Largest Property Use Type ... Weather Normalized Site Electricity (kWh) Total GHG Emissions (Metric Tons CO2e) Direct GHG Emissions (Metric Tons CO2e) Indirect GHG Emissions (Metric Tons CO2e) DOF Property Floor Area (ft²) Property GFA - Self-reported (ft²) Water Use (All Water Sources) (kgal) Municipally Supplied Potable Water - Indoor Intensity (gal/ft²) DEP Provided Water Use (kgal) Automatic Water Benchmarking Eligible
8688 3.051220e+09 Cascade: 229 East 18th Street EAST 18 STREET 11226.0 Brooklyn In Compliance Multifamily Housing Multifamily Housing 67984.0 Not Available ... 200096.0 320.4 263.8 56.6 58272.0 67984.0 5538.9 81.47 5538.9 Eligible
12446 2.028140e+09 NaN ANTHONY AVENUE 10457.0 Bronx In Violation NaN NaN NaN NaN ... NaN NaN NaN NaN 56400.0 NaN NaN NaN NaN NaN
5794 2.027220e+09 887 Southern Blvd SOUTHERN BOULEVARD 10459.0 Bronx In Compliance Multifamily Housing Multifamily Housing 102166.0 Outpatient Rehabilitation/Physical Therapy ... 505340.4 437.4 294.4 143.0 102166.0 114517.0 6115.3 NaN NaN Eligible
201 1.001348e+09 25 Murray Street MURRAY STREET 10007.0 Manhattan In Compliance Multifamily Housing Multifamily Housing 144330.0 Not Available ... 917247.4 523.3 261.0 262.2 144330.0 144330.0 NaN NaN NaN NaN
6651 2.033420e+09 ROSENBERG: 3162 Bainbridge Ave BAINBRIDGE AVENUE 10467.0 Bronx In Compliance Multifamily Housing Multifamily Housing 65320.0 Retail Store ... 286224.0 417.9 336.9 81.0 71000.0 66515.0 NaN NaN NaN NaN

5 rows × 44 columns

In [12]:
df1.shape
Out[12]:
(13223, 44)
In [13]:
df1.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13223 entries, 0 to 13222
Data columns (total 44 columns):
 #   Column                                                           Non-Null Count  Dtype  
---  ------                                                           --------------  -----  
 0   NYC Borough, Block and Lot (BBL)                                 13222 non-null  float64
 1   Property Name                                                    11531 non-null  object 
 2   Street Name                                                      13223 non-null  object 
 3   Zip Code                                                         12627 non-null  float64
 4   Borough                                                          13223 non-null  object 
 5   DOF Benchmarking Submission Status                               13223 non-null  object 
 6   Primary Property Type - Self Selected                            11978 non-null  object 
 7   Largest Property Use Type                                        11978 non-null  object 
 8   Largest Property Use Type - Gross Floor Area (ft²)               11527 non-null  float64
 9   2nd Largest Property Use Type                                    11978 non-null  object 
 10  2nd Largest Property Use - Gross Floor Area (ft²)                11978 non-null  object 
 11  3rd Largest Property Use Type                                    11978 non-null  object 
 12  3rd Largest Property Use Type - Gross Floor Area (ft²)           11978 non-null  object 
 13  Year Built                                                       11978 non-null  object 
 14  Metered Areas (Energy)                                           11978 non-null  object 
 15  Metered Areas  (Water)                                           11978 non-null  object 
 16  ENERGY STAR Score                                                9535 non-null   float64
 17  Site EUI (kBtu/ft²)                                              11439 non-null  float64
 18  Weather Normalized Site EUI (kBtu/ft²)                           9848 non-null   float64
 19  Weather Normalized Site Electricity Intensity (kWh/ft²)          11095 non-null  float64
 20  Weather Normalized Site Natural Gas Intensity (therms/ft²)       9463 non-null   float64
 21  Source EUI (kBtu/ft²)                                            11439 non-null  float64
 22  Weather Normalized Source EUI (kBtu/ft²)                         9848 non-null   float64
 23  Fuel Oil #1 Use (kBtu)                                           11978 non-null  object 
 24  Fuel Oil #2 Use (kBtu)                                           11978 non-null  object 
 25  Fuel Oil #4 Use (kBtu)                                           11978 non-null  object 
 26  Fuel Oil #5 & 6 Use (kBtu)                                       11978 non-null  object 
 27  Diesel #2 Use (kBtu)                                             11978 non-null  object 
 28  District Steam Use (kBtu)                                        11978 non-null  object 
 29  District Hot Water Use (kBtu)                                    11978 non-null  object 
 30  District Chilled Water Use (kBtu)                                11978 non-null  object 
 31  Natural Gas Use (kBtu)                                           10087 non-null  float64
 32  Weather Normalized Site Natural Gas Use (therms)                 9465 non-null   float64
 33  Electricity Use - Grid Purchase (kBtu)                           11425 non-null  float64
 34  Weather Normalized Site Electricity (kWh)                        11097 non-null  float64
 35  Total GHG Emissions (Metric Tons CO2e)                           11478 non-null  float64
 36  Direct GHG Emissions (Metric Tons CO2e)                          11471 non-null  float64
 37  Indirect GHG Emissions (Metric Tons CO2e)                        11495 non-null  float64
 38  DOF Property Floor Area (ft²)                                    12627 non-null  float64
 39  Property GFA - Self-reported (ft²)                               11531 non-null  float64
 40  Water Use (All Water Sources) (kgal)                             7265 non-null   float64
 41  Municipally Supplied Potable Water - Indoor Intensity (gal/ft²)  6391 non-null   float64
 42  DEP Provided Water Use (kgal)                                    5247 non-null   float64
 43  Automatic Water Benchmarking Eligible                            7265 non-null   object 
dtypes: float64(22), object(22)
memory usage: 4.4+ MB
In [14]:
df1.isna().sum()
Out[14]:
NYC Borough, Block and Lot (BBL)                                      1
Property Name                                                      1692
Street Name                                                           0
Zip Code                                                            596
Borough                                                               0
DOF Benchmarking Submission Status                                    0
Primary Property Type - Self Selected                              1245
Largest Property Use Type                                          1245
Largest Property Use Type - Gross Floor Area (ft²)                 1696
2nd Largest Property Use Type                                      1245
2nd Largest Property Use - Gross Floor Area (ft²)                  1245
3rd Largest Property Use Type                                      1245
3rd Largest Property Use Type - Gross Floor Area (ft²)             1245
Year Built                                                         1245
Metered Areas (Energy)                                             1245
Metered Areas  (Water)                                             1245
ENERGY STAR Score                                                  3688
Site EUI (kBtu/ft²)                                                1784
Weather Normalized Site EUI (kBtu/ft²)                             3375
Weather Normalized Site Electricity Intensity (kWh/ft²)            2128
Weather Normalized Site Natural Gas Intensity (therms/ft²)         3760
Source EUI (kBtu/ft²)                                              1784
Weather Normalized Source EUI (kBtu/ft²)                           3375
Fuel Oil #1 Use (kBtu)                                             1245
Fuel Oil #2 Use (kBtu)                                             1245
Fuel Oil #4 Use (kBtu)                                             1245
Fuel Oil #5 & 6 Use (kBtu)                                         1245
Diesel #2 Use (kBtu)                                               1245
District Steam Use (kBtu)                                          1245
District Hot Water Use (kBtu)                                      1245
District Chilled Water Use (kBtu)                                  1245
Natural Gas Use (kBtu)                                             3136
Weather Normalized Site Natural Gas Use (therms)                   3758
Electricity Use - Grid Purchase (kBtu)                             1798
Weather Normalized Site Electricity (kWh)                          2126
Total GHG Emissions (Metric Tons CO2e)                             1745
Direct GHG Emissions (Metric Tons CO2e)                            1752
Indirect GHG Emissions (Metric Tons CO2e)                          1728
DOF Property Floor Area (ft²)                                       596
Property GFA - Self-reported (ft²)                                 1692
Water Use (All Water Sources) (kgal)                               5958
Municipally Supplied Potable Water - Indoor Intensity (gal/ft²)    6832
DEP Provided Water Use (kgal)                                      7976
Automatic Water Benchmarking Eligible                              5958
dtype: int64
In [15]:
df1.nunique()
Out[15]:
NYC Borough, Block and Lot (BBL)                                   13222
Property Name                                                      11522
Street Name                                                         2274
Zip Code                                                             192
Borough                                                                8
DOF Benchmarking Submission Status                                     3
Primary Property Type - Self Selected                                 53
Largest Property Use Type                                             52
Largest Property Use Type - Gross Floor Area (ft²)                  9355
2nd Largest Property Use Type                                         56
2nd Largest Property Use - Gross Floor Area (ft²)                   2231
3rd Largest Property Use Type                                         46
3rd Largest Property Use Type - Gross Floor Area (ft²)               977
Year Built                                                           155
Metered Areas (Energy)                                                10
Metered Areas  (Water)                                                 7
ENERGY STAR Score                                                    100
Site EUI (kBtu/ft²)                                                 1985
Weather Normalized Site EUI (kBtu/ft²)                              1881
Weather Normalized Site Electricity Intensity (kWh/ft²)              432
Weather Normalized Site Natural Gas Intensity (therms/ft²)            58
Source EUI (kBtu/ft²)                                               2892
Weather Normalized Source EUI (kBtu/ft²)                            2750
Fuel Oil #1 Use (kBtu)                                                 8
Fuel Oil #2 Use (kBtu)                                              2291
Fuel Oil #4 Use (kBtu)                                              1319
Fuel Oil #5 & 6 Use (kBtu)                                           464
Diesel #2 Use (kBtu)                                                  17
District Steam Use (kBtu)                                            927
District Hot Water Use (kBtu)                                          2
District Chilled Water Use (kBtu)                                     12
Natural Gas Use (kBtu)                                              9951
Weather Normalized Site Natural Gas Use (therms)                    9322
Electricity Use - Grid Purchase (kBtu)                             11352
Weather Normalized Site Electricity (kWh)                          11042
Total GHG Emissions (Metric Tons CO2e)                              7882
Direct GHG Emissions (Metric Tons CO2e)                             6261
Indirect GHG Emissions (Metric Tons CO2e)                           5645
DOF Property Floor Area (ft²)                                      10103
Property GFA - Self-reported (ft²)                                  9333
Water Use (All Water Sources) (kgal)                                6810
Municipally Supplied Potable Water - Indoor Intensity (gal/ft²)     4788
DEP Provided Water Use (kgal)                                       5009
Automatic Water Benchmarking Eligible                                  2
dtype: int64

Outlier Analysis & Treatment¶

In [16]:
numerical_cols = df1.select_dtypes(include=['float64', 'int64']).columns
In [17]:
numerical_cols
Out[17]:
Index(['NYC Borough, Block and Lot (BBL)', 'Zip Code',
       'Largest Property Use Type - Gross Floor Area (ft²)',
       'ENERGY STAR Score', 'Site EUI (kBtu/ft²)',
       'Weather Normalized Site EUI (kBtu/ft²)',
       'Weather Normalized Site Electricity Intensity (kWh/ft²)',
       'Weather Normalized Site Natural Gas Intensity (therms/ft²)',
       'Source EUI (kBtu/ft²)', 'Weather Normalized Source EUI (kBtu/ft²)',
       'Natural Gas Use (kBtu)',
       'Weather Normalized Site Natural Gas Use (therms)',
       'Electricity Use - Grid Purchase (kBtu)',
       'Weather Normalized Site Electricity (kWh)',
       'Total GHG Emissions (Metric Tons CO2e)',
       'Direct GHG Emissions (Metric Tons CO2e)',
       'Indirect GHG Emissions (Metric Tons CO2e)',
       'DOF Property Floor Area (ft²)', 'Property GFA - Self-reported (ft²)',
       'Water Use (All Water Sources) (kgal)',
       'Municipally Supplied Potable Water - Indoor Intensity (gal/ft²)',
       'DEP Provided Water Use (kgal)'],
      dtype='object')
In [18]:
for col in numerical_cols:
    Q1 = df1[col].quantile(0.25)
    Q3 = df1[col].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR

    df1[col] = df1[col].apply(lambda x: upper_bound if x > upper_bound else (lower_bound if x < lower_bound else x))
In [19]:
df1.describe()
Out[19]:
NYC Borough, Block and Lot (BBL) Zip Code Largest Property Use Type - Gross Floor Area (ft²) ENERGY STAR Score Site EUI (kBtu/ft²) Weather Normalized Site EUI (kBtu/ft²) Weather Normalized Site Electricity Intensity (kWh/ft²) Weather Normalized Site Natural Gas Intensity (therms/ft²) Source EUI (kBtu/ft²) Weather Normalized Source EUI (kBtu/ft²) ... Electricity Use - Grid Purchase (kBtu) Weather Normalized Site Electricity (kWh) Total GHG Emissions (Metric Tons CO2e) Direct GHG Emissions (Metric Tons CO2e) Indirect GHG Emissions (Metric Tons CO2e) DOF Property Floor Area (ft²) Property GFA - Self-reported (ft²) Water Use (All Water Sources) (kgal) Municipally Supplied Potable Water - Indoor Intensity (gal/ft²) DEP Provided Water Use (kgal)
count 1.322200e+04 12627.000000 11527.000000 9535.000000 11439.000000 9848.000000 11095.000000 9463.000000 11439.000000 9848.000000 ... 1.142500e+04 1.109700e+04 11478.000000 11471.000000 11495.000000 12627.000000 11531.000000 7265.000000 6391.000000 5247.000000
mean 2.198121e+09 10576.384414 123855.494326 57.735711 84.950699 84.794121 6.916192 0.449794 139.486963 139.603887 ... 3.258409e+06 9.459251e+05 722.204369 370.945423 292.008752 124528.037044 128253.021204 5942.993930 52.978140 5784.196598
std 1.218340e+09 576.984717 77981.200377 30.143817 33.794678 32.936387 4.390756 0.363962 57.700609 56.949520 ... 3.041626e+06 8.848482e+05 534.520737 291.261814 285.737844 78066.569027 81411.206767 4500.004504 34.959198 3904.043061
min 1.000048e+09 8227.000000 58.000000 1.000000 8.750000 10.762500 0.000000 0.000000 5.175000 7.612500 ... 0.000000e+00 0.000000e+00 0.000000 0.000000 -394.300000 50008.000000 0.000000 0.000000 0.000000 1.000000
25% 1.013110e+09 10024.000000 65369.500000 34.000000 65.300000 65.700000 3.800000 0.100000 103.200000 103.800000 ... 1.045702e+06 3.026190e+05 342.025000 171.500000 86.300000 65000.000000 67176.000000 2661.700000 28.240000 2876.300000
50% 2.029920e+09 10456.000000 92266.000000 63.000000 82.400000 82.500000 5.300000 0.500000 129.200000 129.200000 ... 1.885996e+06 5.454674e+05 519.800000 305.100000 158.000000 92400.000000 94910.000000 4745.600000 45.700000 4799.800000
75% 3.061218e+09 11222.000000 155728.500000 83.000000 103.000000 102.325000 8.900000 0.700000 168.550000 167.925000 ... 4.513704e+06 1.311119e+06 939.450000 505.000000 406.700000 156355.500000 161529.500000 8057.900000 71.680000 7732.800000
max 5.080080e+09 11694.000000 291267.000000 100.000000 159.550000 157.262500 16.550000 1.600000 266.575000 264.112500 ... 9.715707e+06 2.823868e+06 1835.587500 1005.250000 887.300000 293388.750000 303059.750000 16152.200000 136.840000 15017.550000

8 rows × 22 columns

Encoding the required categorical columns¶

In [20]:
from sklearn.preprocessing import LabelEncoder
In [21]:
one_hot_columns = ['Borough','DOF Benchmarking Submission Status']
label_columns = ['Automatic Water Benchmarking Eligible']
In [22]:
df1_encoded = pd.get_dummies(df1, columns=one_hot_columns, drop_first=True)
In [23]:
label_encoder = LabelEncoder()
for col in label_columns:
    df1_encoded[col] = label_encoder.fit_transform(df1_encoded[col])
In [24]:
df1_encoded.head()
Out[24]:
NYC Borough, Block and Lot (BBL) Property Name Street Name Zip Code Primary Property Type - Self Selected Largest Property Use Type Largest Property Use Type - Gross Floor Area (ft²) 2nd Largest Property Use Type 2nd Largest Property Use - Gross Floor Area (ft²) 3rd Largest Property Use Type ... Automatic Water Benchmarking Eligible Borough_Brooklyn Borough_Brooklyn Borough_Manhattan Borough_Manhattan Borough_Not Available Borough_Queens Borough_Staten Island DOF Benchmarking Submission Status_In Violation DOF Benchmarking Submission Status_Not on CBL
0 1.003360e+09 16 Bialystoker Place BIALYSTOKER PLACE 10002.0 Multifamily Housing Multifamily Housing 98000.0 Not Available Not Available Not Available ... 2 False False True False False False False False False
1 1.010618e+09 (7478) - 432 West 52nd Street Condominium WEST 52 STREET 10019.0 Multifamily Housing Multifamily Housing 57401.0 Not Available Not Available Not Available ... 1 False False True False False False False False False
2 1.014270e+09 (9250) - 250 East 73rd Street 2 AVENUE 10021.0 Multifamily Housing Multifamily Housing 166433.0 Not Available Not Available Not Available ... 0 False False True False False False False False False
3 1.015180e+09 (9141) - 141 East 89th Street LEXINGTON AVENUE 10128.0 Multifamily Housing Multifamily Housing 114939.0 Not Available Not Available Not Available ... 0 False False True False False False False False False
4 1.003920e+09 605 East 9th ST EAST 10 STREET 10009.0 Residence Hall/Dormitory Not Available NaN Not Available Not Available Not Available ... 2 False False True False False False False False False

5 rows × 51 columns

In [25]:
print(df1['Borough'].unique())
['Manhattan' 'Bronx' 'Not Available' 'Brooklyn' 'Brooklyn ' 'Queens'
 'Staten Island' 'Manhattan ']
In [26]:
df1['Borough'] = df1['Borough'].str.strip().str.title()
In [27]:
filtered_df = df1[df1['Borough'] != 'Not Available']
plt.figure(figsize = (15, 10))
sns.boxplot(x = "Borough", y='ENERGY STAR Score', data = filtered_df)
plt.show()
No description has been provided for this image

Manhattan has the widest range and highest median emissions, indicating more properties with significant GHG output, likely due to its density.

In [28]:
filtered_df = filtered_df[(filtered_df['Largest Property Use Type'] != 'See Primary BBL') & (filtered_df['Largest Property Use Type'] != 'Other')]
top10_largest_property_types = filtered_df['Largest Property Use Type'].value_counts().head(10).index
filtered_df1 = filtered_df[filtered_df['Largest Property Use Type'].isin(top10_largest_property_types)]
In [29]:
plt.figure(figsize = (18, 8))
sns.barplot(x = "Largest Property Use Type", y='ENERGY STAR Score', data = filtered_df1, errorbar=None)
plt.title("Largest Property Use Type vs ENERGY STAR Score", fontsize=16)
plt.xticks(rotation=45)
plt.show()
No description has been provided for this image

"Office" generally exhibit better energy performance compared to other property types.

In [30]:
filtered_df1['Largest Property Use Type'].value_counts()
Out[30]:
Largest Property Use Type
Multifamily Housing           8569
Office                        1264
Hotel                          237
Non-Refrigerated Warehouse     194
Senior Care Community          111
Residence Hall/Dormitory        99
K-12 School                     93
Retail Store                    90
Self-Storage Facility           87
College/University              63
Name: count, dtype: int64
In [31]:
df1_cor = df1[['Largest Property Use Type - Gross Floor Area (ft²)', 'ENERGY STAR Score', 'Site EUI (kBtu/ft²)', 'Weather Normalized Site EUI (kBtu/ft²)', 'Weather Normalized Site Electricity Intensity (kWh/ft²)', 'Weather Normalized Site Natural Gas Intensity (therms/ft²)', 'Source EUI (kBtu/ft²)', 'Weather Normalized Source EUI (kBtu/ft²)', 'Natural Gas Use (kBtu)', 'Weather Normalized Site Natural Gas Use (therms)', 'Electricity Use - Grid Purchase (kBtu)', 'Weather Normalized Site Electricity (kWh)', 'Total GHG Emissions (Metric Tons CO2e)',
               'Direct GHG Emissions (Metric Tons CO2e)', 'Indirect GHG Emissions (Metric Tons CO2e)', 'DOF Property Floor Area (ft²)', 'Property GFA - Self-reported (ft²)', 'Water Use (All Water Sources) (kgal)', 'Municipally Supplied Potable Water - Indoor Intensity (gal/ft²)', 'DEP Provided Water Use (kgal)']]
correlation_matrix = df1_cor.corr()
In [32]:
plt.figure(figsize=(15, 10))
sns.heatmap(correlation_matrix, annot=True, cmap="viridis", fmt=".2f", linewidths=0.5)
plt.title("Correlation Heatmap")
plt.show()
No description has been provided for this image

Gross Floor Area and Total GHG Emissions have a strong positive correlation. This suggests that larger buildings tend to produce more greenhouse gas emissions, which makes sense considering their higher energy consumption and operational needs.

In [33]:
import warnings
warnings.filterwarnings("ignore")
In [34]:
borough_counts = filtered_df['Borough'].value_counts().reset_index()
borough_counts.columns = ['Borough', 'Count']

plt.figure(figsize=(15, 10))
sns.barplot(x='Borough', y='Count', data=borough_counts)
plt.title("Count of Boroughs")
plt.xlabel("Borough")
plt.ylabel("Count")
plt.show()
No description has been provided for this image

Energy Use Intensity¶

Borough vs Energy Use Intensity¶

In [35]:
eui_data = filtered_df1.groupby('Borough')[['Site EUI (kBtu/ft²)', 'Weather Normalized Site EUI (kBtu/ft²)']].mean().reset_index()
eui_data_melted = eui_data.melt(id_vars='Borough', var_name='Metric', value_name='EUI')
plt.figure(figsize=(15, 12))
sns.barplot(x='Borough', y='EUI', hue='Metric', data=eui_data_melted, palette='viridis')
plt.title('Borough vs Mean Site EUI and Weather Normalized Site EUI', fontsize=16)
plt.xlabel('Borough', fontsize=14)
plt.ylabel('Mean EUI (kBtu/ft²)', fontsize=14)
plt.legend(fontsize=12)
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
No description has been provided for this image

The Bronx has the highest energy utilization, while Brooklyn exhibits the lowest among all boroughs.

Largest Property Use Type vs Energy Use Intensity¶

In [36]:
eui_data = filtered_df1.groupby('Largest Property Use Type')[['Site EUI (kBtu/ft²)', 'Weather Normalized Site EUI (kBtu/ft²)']].mean().reset_index()
eui_data_melted = eui_data.melt(id_vars='Largest Property Use Type', var_name='Metric', value_name='EUI')
plt.figure(figsize=(15, 12))
sns.barplot(x='Largest Property Use Type', y='EUI', hue='Metric', data=eui_data_melted, palette='viridis')
plt.title('Largest Property Use Type vs Mean Site EUI and Weather Normalized Site EUI', fontsize=16)
plt.xlabel('Largest Property Use Type', fontsize=14)
plt.ylabel('Mean EUI (kBtu/ft²)', fontsize=14)
plt.legend(fontsize=12)
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
No description has been provided for this image

Hospitals and senior care facilities have the highest energy intensity, while storage/utility properties exhibit the lowest energy consumption among all property types.

Year Built vs Energy Use Intensity¶

In [37]:
filtered_df1['Year Built'] = pd.to_numeric(filtered_df1['Year Built'], errors='coerce')
filtered_df1 = filtered_df1.dropna(subset=['Year Built'])

filtered_df1['Built Year Group'] = pd.cut(
    filtered_df1['Year Built'], 
    bins=[1800, 1900, 1950, 2000, 2020], 
    labels=['1800-1900', '1900-1950', '1950-2000', '2000-2020']
)
In [38]:
eui_data = filtered_df1.groupby('Built Year Group')[['Site EUI (kBtu/ft²)', 'Weather Normalized Site EUI (kBtu/ft²)']].mean().reset_index()

eui_data_melted = eui_data.melt(id_vars='Built Year Group', var_name='Metric', value_name='EUI')

plt.figure(figsize=(15, 12))
sns.lineplot(x='Built Year Group', y='EUI', hue='Metric', data=eui_data_melted, palette='viridis')

plt.title('Year Built vs Mean Site EUI and Weather Normalized Site EUI', fontsize=16)
plt.xlabel('Built Year Group', fontsize=14)
plt.ylabel('Mean EUI (kBtu/ft²)', fontsize=14)
plt.legend(fontsize=12)
plt.xticks(rotation=45)
plt.tight_layout()

plt.show()
No description has been provided for this image

Buildings constructed between 1980-2000 have the highest mean Site EUI and Weather Normalized Site EUI among all built year groups.

Weather Normalized Site Electricity Intensity (kWh/ft²)¶

Borough vs Weather Normalized Site Electricity Intensity (kWh/ft²)¶

In [39]:
plt.figure(figsize=(15, 10))
sns.barplot(x='Borough', y='Weather Normalized Site Electricity Intensity (kWh/ft²)', data=filtered_df1, errorbar=None,  palette='viridis')
plt.title("Borough vs Weather Normalized Site Electricity Intensity (kWh/ft²)")
plt.xlabel("Borough")
plt.ylabel("Weather Normalized Site Electricity Intensity (kWh/ft²)")
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
No description has been provided for this image

Manhattan exhibits highest Weather normalised site electrical intensity among others.

Largest Property Use Type vs Weather Normalized Site Electricity Intensity (kWh/ft²)¶

In [40]:
plt.figure(figsize=(15, 10))
sns.barplot(x='Largest Property Use Type', y='Weather Normalized Site Electricity Intensity (kWh/ft²)', data=filtered_df1, errorbar=None,  palette='viridis')
plt.title("Largest Property Use Type vs Weather Normalized Site Electricity Intensity (kWh/ft²)")
plt.xlabel("Largest Property Use Type")
plt.ylabel("Weather Normalized Site Electricity Intensity (kWh/ft²)")
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
No description has been provided for this image

Hotel exhibits highest Weather normalised site electrical intensity among others. Likely due to its business nature.

Year Built vs Normalized Site Electricity Intensity (kWh/ft²)¶

In [41]:
plt.figure(figsize=(15, 10))
sns.lineplot(x='Built Year Group', y='Weather Normalized Site Electricity Intensity (kWh/ft²)', data=filtered_df1, errorbar=None)
plt.title("Built Year Group vs Weather Normalized Site Electricity Intensity (kWh/ft²)")
plt.xlabel("Built Year Group")
plt.ylabel("Weather Normalized Site Electricity Intensity (kWh/ft²)")
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
No description has been provided for this image

The building constructed between 1900-1950, exhibits the lowest weather normalized electricity intensity.

Weather Normalized Site Natural Gas Intensity (therms/ft²)¶

Borough vs Weather Normalized Site Natural Gas Intensity (therms/ft²)¶

In [42]:
plt.figure(figsize=(15, 10))
sns.barplot(x='Borough', y='Weather Normalized Site Natural Gas Intensity (therms/ft²)', data=filtered_df1, errorbar=None, palette='viridis')
plt.title("Borough vs Weather Normalized Site Natural Gas Intensity (therms/ft²)")
plt.xlabel("Borough")
plt.ylabel("Weather Normalized Site Natural Gas Intensity (therms/ft²)")
plt.tight_layout()
plt.xticks(rotation=45)
plt.show()
No description has been provided for this image

Brooklyn has the highest weather normalized site natural gas intensity among the boroughs.

Largest Use Property Type vs Weather Normalized Site Natural Gas Intensity (therms/ft²)¶

In [43]:
plt.figure(figsize=(15, 10))
sns.barplot(x='Largest Property Use Type', y='Weather Normalized Site Natural Gas Intensity (therms/ft²)', data=filtered_df1, errorbar=None, palette='viridis')
plt.title("Largest Property Use Type vs Weather Normalized Site Natural Gas Intensity (therms/ft²)")
plt.xlabel('Largest Property Use Type')
plt.ylabel("Weather Normalized Site Natural Gas Intensity (therms/ft²)")
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
No description has been provided for this image

Senior Care Communities have the highest Weather Normalized Site Natural Gas Intensity at approximately 0.8 therms/ft². This could be due to their high energy needs for heating, cooking, and other services required for residents.

Built Year vs Weather Normalized Site Natural Gas Intensity (therms/ft²)¶

In [44]:
plt.figure(figsize=(15, 10))
sns.lineplot(x='Built Year Group', y='Weather Normalized Site Natural Gas Intensity (therms/ft²)', data=filtered_df1, errorbar=None)
plt.title("Borough vs Weather Normalized Site Natural Gas Intensity (therms/ft²)")
plt.xlabel("Built Year")
plt.ylabel("Weather Normalized Site Natural Gas Intensity (therms/ft²)")
plt.tight_layout()
plt.show()
No description has been provided for this image

The natural gas intensity incerasing gradully over the period till 2000, after that decreasing likely due to the evalution of electrical usages increasing.

Natural Gas Use (kBtu) & Electricity Use - Grid Purchase (kBtu)¶

Borough vs Natural Gas Use (kBtu) & Electricity Use - Grid Purchase (kBtu)¶

In [45]:
energy_data = filtered_df1.groupby('Borough')[['Natural Gas Use (kBtu)', 'Electricity Use - Grid Purchase (kBtu)']].mean().reset_index()
energy_data_melted = energy_data.melt(id_vars='Borough', var_name='Energy Type', value_name='Usage (kBtu)')
plt.figure(figsize=(16, 14))
sns.barplot(x='Borough', y='Usage (kBtu)', hue='Energy Type', data=energy_data_melted, palette='viridis')
plt.title('Energy Usage by Borough and Type (kBtu)', fontsize=16)
plt.xlabel('Borough', fontsize=14)
plt.ylabel('Usage (kBtu)', fontsize=14)
plt.xticks(rotation=45)
plt.legend(title='Energy Type', fontsize=12, loc='upper left')
plt.tight_layout()
plt.show()
No description has been provided for this image

Queens has the highest natural gas usage among the boroughs, likely due to its larger residential area and population density.

Largest Property Use Type vs Natural Gas Use (kBtu) & Electricity Use - Grid Purchase (kBtu)¶

In [46]:
energy_data = filtered_df1.groupby('Largest Property Use Type')[['Natural Gas Use (kBtu)', 'Electricity Use - Grid Purchase (kBtu)']].mean().reset_index()
energy_data_melted = energy_data.melt(id_vars='Largest Property Use Type', var_name='Energy Type', value_name='Usage (kBtu)')
plt.figure(figsize=(16, 14))
sns.barplot(x='Largest Property Use Type', y='Usage (kBtu)', hue='Energy Type', data=energy_data_melted, palette='viridis')
plt.title('Energy Usage by Largest Property Use Type (kBtu)', fontsize=16)
plt.xlabel('Largest Property Use Type', fontsize=14)
plt.ylabel('Usage (kBtu)', fontsize=14)
plt.xticks(rotation=45)
plt.legend(title='Energy Type', fontsize=12, loc='upper left')
plt.tight_layout()
plt.show()
No description has been provided for this image

Senior Care Communities have the highest energy usage in terms of electricity among all property types. This is likely due to the extensive energy requirements for providing consistent heating, cooling, and various services necessary for residents.

Built Year vs Natural Gas Use (kBtu) & Electricity Use - Grid Purchase (kBtu)¶

In [47]:
energy_data = filtered_df1.groupby('Built Year Group')[['Natural Gas Use (kBtu)', 'Electricity Use - Grid Purchase (kBtu)']].mean().reset_index()
energy_data_melted = energy_data.melt(id_vars='Built Year Group', var_name='Energy Type', value_name='Usage (kBtu)')
plt.figure(figsize=(16, 14))
sns.lineplot(x='Built Year Group', y='Usage (kBtu)', hue='Energy Type', data=energy_data_melted, palette='viridis')
plt.title('Energy Usage by Built Year Group (kBtu)', fontsize=16)
plt.xlabel('Built Year Group', fontsize=14)
plt.ylabel('Usage (kBtu)', fontsize=14)
plt.legend(title='Energy Type', fontsize=12, loc='upper left')
plt.tight_layout()
plt.show()
No description has been provided for this image

The line graph shows that natural gas usage peaked in buildings constructed between 1950-2000, likely due to the boom in construction during that era and the prevalent use of natural gas for heating and other energy needs.

Weather Normalized Site Electricity (kWh)¶

Borough vs Weather Normalized Site Electricity (kWh)¶

In [48]:
plt.figure(figsize=(15, 10))
sns.barplot(x='Borough', y='Weather Normalized Site Electricity (kWh)', data=filtered_df1, errorbar=None,  palette='viridis')
plt.title("Borough vs Weather Normalized Site Electricity (kWh)")
plt.xlabel("Borough")
plt.ylabel("Weather Normalized Site Electricity (kWh)")
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
No description has been provided for this image

The bar chart reveals that Manhattan has the highest electricity usage (weather normalized site electricity) compared to the other boroughs, likely due to its dense population and high concentration of commercial buildings.

Largest Property Use Type vs Weather Normalized Site Electricity (kWh)¶

In [49]:
plt.figure(figsize=(15, 10))
sns.barplot(x='Largest Property Use Type', y='Weather Normalized Site Electricity (kWh)', data=filtered_df1, errorbar=None,  palette='viridis')
plt.title("Largest Property Use Type vs Weather Normalized Site Electricity (kWh)")
plt.xlabel("Largest Property Use Type")
plt.ylabel("Weather Normalized Site Electricity (kWh)")
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
No description has been provided for this image

Office buildings have the highest weather normalized site electricity usage among property types. This is likely due to their large occupancy, extensive lighting, and HVAC needs.

Built Year Group vs Weather Normalized Site Electricity (kWh)¶

In [50]:
plt.figure(figsize=(15, 10))
sns.lineplot(x='Built Year Group', y='Weather Normalized Site Electricity (kWh)', data=filtered_df1, errorbar=None)
plt.title("Built Year Group vs Weather Normalized Site Electricity (kWh)")
plt.xlabel("Built Year Group")
plt.ylabel("Weather Normalized Site Electricity (kWh)")
plt.tight_layout()
plt.show()
No description has been provided for this image

Total GHG Emissions & Direct GHG Emissions & Indirect GHG Emissions (Metric Tons CO2e)¶

Borough vs Total GHG Emissions & Direct GHG Emissions & Indirect GHG Emissions (Metric Tons CO2e)¶

In [51]:
emission_data = filtered_df.groupby('Borough')[
    ['Total GHG Emissions (Metric Tons CO2e)', 
     'Direct GHG Emissions (Metric Tons CO2e)', 
     'Indirect GHG Emissions (Metric Tons CO2e)']
].mean().reset_index()


emission_data_melted = emission_data.melt(
    id_vars='Borough', 
    var_name='GHG Emissions', 
    value_name='Metric Tons CO2e'
)


plt.figure(figsize=(16, 14))
sns.barplot(
    x='Borough', 
    y='Metric Tons CO2e', 
    hue='GHG Emissions',  
    data=emission_data_melted, 
    palette='viridis'
)


plt.title('GHG Emissions by Borough (Metric Tons CO2e)', fontsize=16)
plt.xlabel('Borough', fontsize=14)
plt.ylabel('Metric Tons CO2e', fontsize=14)
plt.xticks(rotation=45)
plt.legend(title='Emission Type', fontsize=12, loc='upper left')
plt.tight_layout()
plt.show()
No description has been provided for this image

The bar chart reveals that Manhattan has the highest total GHG emissions among the boroughs, likely due to its dense population and high concentration of commercial buildings.

Largest Property Use Type Vs Total GHG Emissions & Direct GHG Emissions & Indirect GHG Emissions (Metric Tons CO2e)¶

In [52]:
emission_data = filtered_df1.groupby('Largest Property Use Type')[
    ['Total GHG Emissions (Metric Tons CO2e)', 
     'Direct GHG Emissions (Metric Tons CO2e)', 
     'Indirect GHG Emissions (Metric Tons CO2e)']
].mean().reset_index()


emission_data_melted = emission_data.melt(
    id_vars='Largest Property Use Type', 
    var_name='GHG Emissions', 
    value_name='Metric Tons CO2e'
)

plt.figure(figsize=(20, 16))
sns.barplot(
    x='Largest Property Use Type', 
    y='Metric Tons CO2e', 
    hue='GHG Emissions',  
    data=emission_data_melted, 
    palette='viridis'
)


plt.title('GHG Emissions by Largest Property Use Type (Metric Tons CO2e)', fontsize=16)
plt.xlabel('Largest Property Use Type', fontsize=14)
plt.ylabel('Metric Tons CO2e', fontsize=14)
plt.xticks(rotation=45)
plt.legend(title='Emission Type', fontsize=12, loc='upper left')
plt.tight_layout()
plt.show()
No description has been provided for this image

The chart shows that hotels have the highest total GHG emissions among all property types, primarily due to their high energy consumption from HVAC systems and lighting.

Built Year Group Vs Total GHG Emissions & Direct GHG Emissions & Indirect GHG Emissions (Metric Tons CO2e)¶

In [53]:
emission_data = filtered_df1.groupby('Built Year Group')[
    ['Total GHG Emissions (Metric Tons CO2e)', 
     'Direct GHG Emissions (Metric Tons CO2e)', 
     'Indirect GHG Emissions (Metric Tons CO2e)']
].mean().reset_index()


emission_data_melted = emission_data.melt(
    id_vars='Built Year Group', 
    var_name='GHG Emissions', 
    value_name='Metric Tons CO2e'
)

plt.figure(figsize=(20, 16))
sns.lineplot(
    x='Built Year Group', 
    y='Metric Tons CO2e', 
    hue='GHG Emissions',  
    data=emission_data_melted, 
    palette='viridis'
)


plt.title('GHG Emissions by Largest Property Use Type (Metric Tons CO2e)', fontsize=16)
plt.xlabel('Built Year Group', fontsize=14)
plt.ylabel('Metric Tons CO2e', fontsize=14)
plt.legend(title='Emission Type', fontsize=12, loc='upper left')
plt.tight_layout()
plt.show()
No description has been provided for this image

Total GHG Emissions peaked in buildings constructed between 1950-2000, which could be attributed to the construction boom and the prevalent use of energy-intensive materials and systems during that period.

Water Use (All Water Sources) (kgal)¶

Borough vs Water Use (All Water Sources) (kgal)¶

In [54]:
plt.figure(figsize=(16, 14))
sns.barplot(
    x='Borough', 
    y='Water Use (All Water Sources) (kgal)', 
    data=filtered_df, 
    palette='viridis',
    errorbar=None
)


plt.title('Borough vs Water Use (All Water Sources) (kgal)', fontsize=16)
plt.xlabel('Borough', fontsize=14)
plt.ylabel('Water Use (All Water Sources) (kgal)', fontsize=14)
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
No description has been provided for this image

Largest Property Use Type Vs Water Use (All Water Sources) (kgal)¶

In [55]:
plt.figure(figsize=(16, 14))
sns.barplot(
    x='Largest Property Use Type', 
    y='Water Use (All Water Sources) (kgal)', 
    data=filtered_df1, 
    palette='viridis',
    errorbar=None
)


plt.title('Largest Property Use Type vs Water Use (All Water Sources) (kgal)', fontsize=16)
plt.xlabel('Largest Property Use Type', fontsize=14)
plt.ylabel('Water Use (All Water Sources) (kgal)', fontsize=14)
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
No description has been provided for this image

The bar chart reveals that hotels have the highest water use among all property types, likely due to their extensive guest services, including laundry, pools, and large kitchens.

Built Year Group Vs Water Use (All Water Sources) (kgal)¶

In [56]:
plt.figure(figsize=(16, 14))
sns.lineplot(
    x='Built Year Group', 
    y='Water Use (All Water Sources) (kgal)', 
    data=filtered_df1, 
    palette='viridis',
    errorbar=None
)

plt.title('Built Year Group vs Water Use (All Water Sources) (kgal)', fontsize=16)
plt.xlabel('Built Year Group', fontsize=14)
plt.ylabel('Water Use (All Water Sources) (kgal)', fontsize=14)
plt.tight_layout()
plt.show()
No description has been provided for this image

Municipally Supplied Potable Water - Indoor Intensity (gal/ft²)¶

Borough vs Municipally Supplied Potable Water - Indoor Intensity (gal/ft²)¶

In [57]:
plt.figure(figsize=(16, 14))
sns.barplot(
    x='Borough', 
    y='Municipally Supplied Potable Water - Indoor Intensity (gal/ft²)', 
    data=filtered_df, 
    palette='viridis',
    errorbar=None
)


plt.title('Borough vs Municipally Supplied Potable Water - Indoor Intensity (gal/ft²)', fontsize=16)
plt.xlabel('Borough', fontsize=14)
plt.ylabel('Municipally Supplied Potable Water - Indoor Intensity (gal/ft²)', fontsize=14)
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
No description has been provided for this image

Largest Property Use Type vs Municipally Supplied Potable Water - Indoor Intensity (gal/ft²)¶

In [58]:
plt.figure(figsize=(16, 14))
sns.barplot(
    x='Largest Property Use Type', 
    y='Municipally Supplied Potable Water - Indoor Intensity (gal/ft²)', 
    data=filtered_df1, 
    palette='viridis',
    errorbar=None
)


plt.title('Largest Property Use Type vs Municipally Supplied Potable Water - Indoor Intensity (gal/ft²)', fontsize=16)
plt.xlabel('Largest Property Use Type', fontsize=14)
plt.ylabel('Municipally Supplied Potable Water - Indoor Intensity (gal/ft²)', fontsize=14)
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
No description has been provided for this image

Senior Care Communities have the highest indoor water intensity. This is likely due to the extensive water usage required for services such as bathing, laundry, and general care for residents.

Built Year Group vs Municipally Supplied Potable Water - Indoor Intensity (gal/ft²)¶

In [59]:
plt.figure(figsize=(16, 14))
sns.lineplot(
    x='Built Year Group', 
    y='Municipally Supplied Potable Water - Indoor Intensity (gal/ft²)', 
    data=filtered_df1,
    errorbar=None
)


plt.title('Built Year Group vs Municipally Supplied Potable Water - Indoor Intensity (gal/ft²)', fontsize=16)
plt.xlabel('Built Year Group', fontsize=14)
plt.ylabel('Municipally Supplied Potable Water - Indoor Intensity (gal/ft²)', fontsize=14)
plt.tight_layout()
plt.show()
No description has been provided for this image

Exploratory Data Analysis¶

Checking Normality of Data¶

In [60]:
import statsmodels.api as sm
In [61]:
norm_cols = ['Site EUI (kBtu/ft²)', 'Weather Normalized Site EUI (kBtu/ft²)', 'Weather Normalized Site Electricity Intensity (kWh/ft²)', 'Weather Normalized Site Natural Gas Intensity (therms/ft²)', 'Source EUI (kBtu/ft²)', 'Weather Normalized Source EUI (kBtu/ft²)', 'Natural Gas Use (kBtu)', 'Weather Normalized Site Natural Gas Use (therms)', 'Electricity Use - Grid Purchase (kBtu)', 'Weather Normalized Site Electricity (kWh)', 'Total GHG Emissions (Metric Tons CO2e)', 'Direct GHG Emissions (Metric Tons CO2e)', 'Indirect GHG Emissions (Metric Tons CO2e)', 'DOF Property Floor Area (ft²)', 'Property GFA - Self-reported (ft²)', 'Water Use (All Water Sources) (kgal)']

num_cols = len(norm_cols)
cols_per_row = 3
rows = (num_cols // cols_per_row) + (num_cols % cols_per_row > 0)

fig, axes = plt.subplots(rows, cols_per_row, figsize=(15, rows * 5))
axes = axes.flatten()


for i, col in enumerate(norm_cols):
    if col in filtered_df.columns:
        col_data = pd.to_numeric(filtered_df[col], errors='coerce').dropna()
        
        if not col_data.empty:
            sm.qqplot(col_data, line="s", ax=axes[i])
            axes[i].set_title(f"QQ Plot: {col}")
        else:
            axes[i].set_title(f"No Data: {col}")
    else:
        axes[i].set_title(f"Column Not Found: {col}")

for j in range(i + 1, len(axes)):
    fig.delaxes(axes[j])

plt.tight_layout()
plt.show()
No description has been provided for this image

ENERGY STAR Scores¶

In [62]:
from scipy.stats import kruskal
In [63]:
def perform_kruskal_test(df, group_col, target_col):
    grouped_data = df.groupby(group_col)[target_col].apply(list)
    
    if len(grouped_data) > 1:
        stat, p_value = kruskal(*grouped_data)
        print(f"Kruskal-Wallis Test for {target_col} by {group_col}")
        print(f"Statistic: {stat:.4f}, p-value: {p_value:.4f}")
        if p_value < 0.05:
            print(f"Result: Significant difference exists between groups of {group_col}.\n")
        else:
            print(f"Result: No significant difference between groups of {group_col}.\n")
    else:
        print(f"Not enough valid groups for {group_col}.\n")

group_columns = ['Largest Property Use Type', 'Borough', 'Year Built']
target_column = 'ENERGY STAR Score'


filtered_df1[target_column] = pd.to_numeric(filtered_df1[target_column], errors='coerce')
filtered_df1 = filtered_df1.dropna(subset=[target_column])


for group_col in group_columns:
    perform_kruskal_test(filtered_df1, group_col, target_column)
Kruskal-Wallis Test for ENERGY STAR Score by Largest Property Use Type
Statistic: 304.3264, p-value: 0.0000
Result: Significant difference exists between groups of Largest Property Use Type.

Kruskal-Wallis Test for ENERGY STAR Score by Borough
Statistic: 99.6495, p-value: 0.0000
Result: Significant difference exists between groups of Borough.

Kruskal-Wallis Test for ENERGY STAR Score by Year Built
Statistic: 545.4667, p-value: 0.0000
Result: Significant difference exists between groups of Year Built.

GHG emissions¶

In [64]:
from scipy.stats import pearsonr
In [65]:
emission_columns = ['Total GHG Emissions (Metric Tons CO2e)', 
                    'Direct GHG Emissions (Metric Tons CO2e)', 
                    'Indirect GHG Emissions (Metric Tons CO2e)']

comp_columns = ["Year Built",
               "Weather Normalized Site Electricity Intensity (kWh/ft²)",
               "Weather Normalized Site EUI (kBtu/ft²)",
               "Weather Normalized Site Natural Gas Intensity (therms/ft²)",
               "Natural Gas Use (kBtu)",
               "Weather Normalized Site Electricity (kWh)",
               "DOF Property Floor Area (ft²)",
               "Water Use (All Water Sources) (kgal)"]

for col in emission_columns:
    for cl in comp_columns:
        valid_data = filtered_df1[[cl, col]].dropna()
        pearson_corr, pearson_p = pearsonr(valid_data[cl], valid_data[col])
        
        print(f"Correlation Test for {cl} vs {col}:")
        print(f"Pearson: Correlation={pearson_corr}, p-value={pearson_p}")
    
        if pearson_p < 0.05:
            print(f"The relationship between {cl} and the {col} is statistically significant.\n")
        else:
            print(f"The relationship between {cl} Built and the {col} is statistically not-significant.\n")
Correlation Test for Year Built vs Total GHG Emissions (Metric Tons CO2e):
Pearson: Correlation=0.15259013034354285, p-value=4.838082561006758e-49
The relationship between Year Built and the Total GHG Emissions (Metric Tons CO2e) is statistically significant.

Correlation Test for Weather Normalized Site Electricity Intensity (kWh/ft²) vs Total GHG Emissions (Metric Tons CO2e):
Pearson: Correlation=0.4856422948366791, p-value=0.0
The relationship between Weather Normalized Site Electricity Intensity (kWh/ft²) and the Total GHG Emissions (Metric Tons CO2e) is statistically significant.

Correlation Test for Weather Normalized Site EUI (kBtu/ft²) vs Total GHG Emissions (Metric Tons CO2e):
Pearson: Correlation=0.4102876267725338, p-value=1.62e-321
The relationship between Weather Normalized Site EUI (kBtu/ft²) and the Total GHG Emissions (Metric Tons CO2e) is statistically significant.

Correlation Test for Weather Normalized Site Natural Gas Intensity (therms/ft²) vs Total GHG Emissions (Metric Tons CO2e):
Pearson: Correlation=0.029316198047107174, p-value=0.010068827538074178
The relationship between Weather Normalized Site Natural Gas Intensity (therms/ft²) and the Total GHG Emissions (Metric Tons CO2e) is statistically significant.

Correlation Test for Natural Gas Use (kBtu) vs Total GHG Emissions (Metric Tons CO2e):
Pearson: Correlation=0.52093512413176, p-value=0.0
The relationship between Natural Gas Use (kBtu) and the Total GHG Emissions (Metric Tons CO2e) is statistically significant.

Correlation Test for Weather Normalized Site Electricity (kWh) vs Total GHG Emissions (Metric Tons CO2e):
Pearson: Correlation=0.8537397902299549, p-value=0.0
The relationship between Weather Normalized Site Electricity (kWh) and the Total GHG Emissions (Metric Tons CO2e) is statistically significant.

Correlation Test for DOF Property Floor Area (ft²) vs Total GHG Emissions (Metric Tons CO2e):
Pearson: Correlation=0.8430661787822864, p-value=0.0
The relationship between DOF Property Floor Area (ft²) and the Total GHG Emissions (Metric Tons CO2e) is statistically significant.

Correlation Test for Water Use (All Water Sources) (kgal) vs Total GHG Emissions (Metric Tons CO2e):
Pearson: Correlation=0.5878536737678786, p-value=0.0
The relationship between Water Use (All Water Sources) (kgal) and the Total GHG Emissions (Metric Tons CO2e) is statistically significant.

Correlation Test for Year Built vs Direct GHG Emissions (Metric Tons CO2e):
Pearson: Correlation=0.03726436492965085, p-value=0.00035280117110480163
The relationship between Year Built and the Direct GHG Emissions (Metric Tons CO2e) is statistically significant.

Correlation Test for Weather Normalized Site Electricity Intensity (kWh/ft²) vs Direct GHG Emissions (Metric Tons CO2e):
Pearson: Correlation=-0.0885675251026727, p-value=4.9009147425238425e-17
The relationship between Weather Normalized Site Electricity Intensity (kWh/ft²) and the Direct GHG Emissions (Metric Tons CO2e) is statistically significant.

Correlation Test for Weather Normalized Site EUI (kBtu/ft²) vs Direct GHG Emissions (Metric Tons CO2e):
Pearson: Correlation=0.4280504166183924, p-value=0.0
The relationship between Weather Normalized Site EUI (kBtu/ft²) and the Direct GHG Emissions (Metric Tons CO2e) is statistically significant.

Correlation Test for Weather Normalized Site Natural Gas Intensity (therms/ft²) vs Direct GHG Emissions (Metric Tons CO2e):
Pearson: Correlation=0.2750033377742449, p-value=9.540230181821916e-134
The relationship between Weather Normalized Site Natural Gas Intensity (therms/ft²) and the Direct GHG Emissions (Metric Tons CO2e) is statistically significant.

Correlation Test for Natural Gas Use (kBtu) vs Direct GHG Emissions (Metric Tons CO2e):
Pearson: Correlation=0.6670485914026987, p-value=0.0
The relationship between Natural Gas Use (kBtu) and the Direct GHG Emissions (Metric Tons CO2e) is statistically significant.

Correlation Test for Weather Normalized Site Electricity (kWh) vs Direct GHG Emissions (Metric Tons CO2e):
Pearson: Correlation=0.2363697257738262, p-value=9.553896549835233e-114
The relationship between Weather Normalized Site Electricity (kWh) and the Direct GHG Emissions (Metric Tons CO2e) is statistically significant.

Correlation Test for DOF Property Floor Area (ft²) vs Direct GHG Emissions (Metric Tons CO2e):
Pearson: Correlation=0.41156033341204634, p-value=0.0
The relationship between DOF Property Floor Area (ft²) and the Direct GHG Emissions (Metric Tons CO2e) is statistically significant.

Correlation Test for Water Use (All Water Sources) (kgal) vs Direct GHG Emissions (Metric Tons CO2e):
Pearson: Correlation=0.4849217766361128, p-value=0.0
The relationship between Water Use (All Water Sources) (kgal) and the Direct GHG Emissions (Metric Tons CO2e) is statistically significant.

Correlation Test for Year Built vs Indirect GHG Emissions (Metric Tons CO2e):
Pearson: Correlation=0.18980566031154897, p-value=2.314870990698141e-75
The relationship between Year Built and the Indirect GHG Emissions (Metric Tons CO2e) is statistically significant.

Correlation Test for Weather Normalized Site Electricity Intensity (kWh/ft²) vs Indirect GHG Emissions (Metric Tons CO2e):
Pearson: Correlation=0.7147603927198645, p-value=0.0
The relationship between Weather Normalized Site Electricity Intensity (kWh/ft²) and the Indirect GHG Emissions (Metric Tons CO2e) is statistically significant.

Correlation Test for Weather Normalized Site EUI (kBtu/ft²) vs Indirect GHG Emissions (Metric Tons CO2e):
Pearson: Correlation=0.21362846978777594, p-value=5.345671579890487e-83
The relationship between Weather Normalized Site EUI (kBtu/ft²) and the Indirect GHG Emissions (Metric Tons CO2e) is statistically significant.

Correlation Test for Weather Normalized Site Natural Gas Intensity (therms/ft²) vs Indirect GHG Emissions (Metric Tons CO2e):
Pearson: Correlation=-0.12746269213152003, p-value=2.8249315800464165e-29
The relationship between Weather Normalized Site Natural Gas Intensity (therms/ft²) and the Indirect GHG Emissions (Metric Tons CO2e) is statistically significant.

Correlation Test for Natural Gas Use (kBtu) vs Indirect GHG Emissions (Metric Tons CO2e):
Pearson: Correlation=0.3099632385025462, p-value=5.415716841659371e-181
The relationship between Natural Gas Use (kBtu) and the Indirect GHG Emissions (Metric Tons CO2e) is statistically significant.

Correlation Test for Weather Normalized Site Electricity (kWh) vs Indirect GHG Emissions (Metric Tons CO2e):
Pearson: Correlation=0.9635466262650673, p-value=0.0
The relationship between Weather Normalized Site Electricity (kWh) and the Indirect GHG Emissions (Metric Tons CO2e) is statistically significant.

Correlation Test for DOF Property Floor Area (ft²) vs Indirect GHG Emissions (Metric Tons CO2e):
Pearson: Correlation=0.7665161539874037, p-value=0.0
The relationship between DOF Property Floor Area (ft²) and the Indirect GHG Emissions (Metric Tons CO2e) is statistically significant.

Correlation Test for Water Use (All Water Sources) (kgal) vs Indirect GHG Emissions (Metric Tons CO2e):
Pearson: Correlation=0.42686714753730404, p-value=3.375328044704652e-262
The relationship between Water Use (All Water Sources) (kgal) and the Indirect GHG Emissions (Metric Tons CO2e) is statistically significant.

Pearson Correlation Test: ENERGY STAR Score vs Site EUI¶

In [66]:
correlation_data = filtered_df[['ENERGY STAR Score', 'Site EUI (kBtu/ft²)']].dropna()

correlation, p_value = pearsonr(correlation_data['ENERGY STAR Score'], correlation_data['Site EUI (kBtu/ft²)'])
print("Pearson Correlation Test:")
print(f"Correlation coefficient: {correlation:.4f}, p-value: {p_value:.4f}")

if p_value < 0.05:
    print("Reject the null hypothesis: There is a significant correlation between ENERGY STAR Score and Site EUI.")
else:
    print("Fail to reject the null hypothesis: No significant correlation between ENERGY STAR Score and Site EUI.")
Pearson Correlation Test:
Correlation coefficient: -0.7414, p-value: 0.0000
Reject the null hypothesis: There is a significant correlation between ENERGY STAR Score and Site EUI.

Chi-Square Test: Primary Property Type vs Borough¶

In [67]:
from scipy.stats import chi2_contingency
In [68]:
pivot_table = filtered_df1.pivot_table(index='Primary Property Type - Self Selected', 
                                       columns='Borough', 
                                       aggfunc='size', 
                                       fill_value=0)

plt.figure(figsize=(15, 8))
sns.heatmap(pivot_table, annot=True, fmt='d', cmap='YlGnBu')
plt.title('Heatmap of Property Types by Borough', fontsize=16)
plt.xlabel('Borough', fontsize=14)
plt.ylabel('Primary Property Type', fontsize=14)
plt.tight_layout()
plt.show()
No description has been provided for this image

Multifamily Housing is the most common property type in Manhattan, with 2714 buildings, reflecting the high population density and residential demand in the borough.

In [69]:
contingency_table = pd.crosstab(filtered_df['Primary Property Type - Self Selected'], filtered_df['Borough'])

chi2, p_value, dof, expected = chi2_contingency(contingency_table)
print("Chi-Square Test:")
print(f"Chi-Square Statistic: {chi2:.4f}, p-value: {p_value:.4f}, Degrees of Freedom: {dof}")

if p_value < 0.05:
    print("Reject the null hypothesis: There is an association between Primary Property Type and Borough.")
else:
    print("Fail to reject the null hypothesis: No association between Primary Property Type and Borough.")
Chi-Square Test:
Chi-Square Statistic: 2249.3423, p-value: 0.0000, Degrees of Freedom: 204
Reject the null hypothesis: There is an association between Primary Property Type and Borough.

Insights¶

Energy Usage and Efficiency¶

  1. Energy Efficiency by Borough: Manhattan displays higher energy efficiency, possibly due to modernized infrastructure.

  2. Weather-Normalized EUI: Weather-normalized Site EUI shows less variability, highlighting standardization benefits in energy reporting.

  3. Primary Property Type Impact: Commercial properties have higher Site EUIs compared to residential properties, indicating higher energy demands.

  4. Built year: Older buildings before 1950 had lower energy intensity. Buildings from 1950-2000 have higher energy use, likely due to inefficient systems and increased energy demand. Post-2000 improvements showcase a strong impact of modern energy-saving measures, sustainable designs, and policies.

GHG Emissions¶

  1. GHG Emissions Trends: Indirect GHG emissions contribute significantly compared to direct emissions, suggesting electricity reliance.

  2. Borough GHG Variance: Manhattan and Brooklyn have higher total GHG emissions due to higher property density.

Building Age Analysis¶

  1. Old vs. New Buildings: Older buildings exhibit higher energy and water usage, showing inefficiencies in older technologies.

  2. Modern Properties: Newer buildings have lower GHG emissions, aligning with stricter building codes and energy-efficient designs.

Water and Gas Usage¶

  1. Water Usage: Water consumption trends align with property size and type, with larger properties showing higher water use.

  2. Natural Gas Usage: Larger commercial buildings exhibit a spike in natural gas usage.

Correlations and Trends¶

  1. Correlation Between Variables: Energy intensity correlates positively with GHG emissions but shows weak correlation with property floor area.

  2. Energy Star Score: Buildings with higher Energy Star scores show significantly reduced emissions, validating score relevance.

Grouped Comparisons¶

  1. Borough-Property Type Pairing: Combining Borough and Primary Property Type reveals hotspots for energy inefficiency, especially in older commercial buildings.

  2. Built Year Grouping: Older buildings grouped by built year consistently show higher EUI and emissions metrics.

Hypothesis Testing¶

  1. Kruskal-Wallis Test: Significant differences in GHG emissions among boroughs suggest location-based energy policies might be effective.

  2. Comparison by Property Type: Hypothesis testing confirms energy performance varies significantly by property type, emphasizing tailored efficiency programs.